Rapid Building of an ASR System for Under-Resourced Languages Based on Multilingual Unsupervised Training

نویسندگان

  • Ngoc Thang Vu
  • Franziska Kraus
  • Tanja Schultz
چکیده

This paper presents our work on rapid language adaptation of acoustic models based on multilingual cross-language bootstrapping and unsupervised training. We used Automatic Speech Recognition (ASR) systems in the six source languages English, French, German, Spanish, Bulgarian and Polish to build from scratch an ASR system for Vietnamese, an underresourced language. System building was performed without using any transcribed audio data by applying three consecutive steps, i.e. cross-language transfer, unsupervised training based on the “multilingual A-stabil” confidence score [1], and bootstrapping. We investigated the correlation between performance of “multilingual A-stabil” and the number of source languages and improved the performance of “multilingual A-stabil” by applying it at the syllable level. Furthermore, we showed that increasing the amount of source language ASR systems for the multilingual framework results in better performance of the final ASR system in the target language Vietnamese. The final Vietnamese recognition system has a Syllable Error Rate (SyllER) of 16.8% on the development set and 16.1% on the evaluation set.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Under-Resourced Language ASR Through Latent Subword Unit Space Discovery

Development of state-of-the-art automatic speech recognition (ASR) systems requires acoustic resources (i.e., transcribed speech) as well as lexical resources (i.e., phonetic lexicons). It has been shown that acoustic and lexical resource constraints can be overcome by first training an acoustic model that captures acoustic-to-multilingual phone relationships on languageindependent data; and th...

متن کامل

A first LVCSR system for Luxembourgish, an under-resourced European language

Luxembourgish is embedded in a multilingual context on the divide between Romance and Germanic cultures and remains one of Europe’s under-described languages. We describe our efforts in building an large vocabulary ASR system for such a “minority” language (target language: Luxembourgish) without any transcribed audio training data. Instead, acoustic models are derived from major languages (sou...

متن کامل

Unsupervised acoustic model training using multiple seed ASR systems

Unsupervised acoustic modeling can offer a cost and time effective way of creating a solid acoustic model for any under-resourced language. This paper explores the novel idea of using two independent ASR systems to transcribe new speech data, align and filter the ASR hypotheses and use the presumably correct transcriptions to iteratively improve the two seed ASR systems. In parallel, the newly ...

متن کامل

Combination of multilingual and semi-supervised training for under-resourced languages

Multilingual training of neural networks for ASR is widely studied these days. It has been shown that languages with little training data can benefit largely from the multilingual resources for training. The use of unlabeled data for the neural network training in semi-supervised manner has also improved the ASR system performance. Here, the combination of both methods is presented. First, mult...

متن کامل

Speech data collection in an under-resourced language within a multilingual context

In this paper, we present an end-to-end solution to the development of an automatic speech recognition (ASR) system in typical under-resourced languages, where the target language is likely to be influenced by one more embedded foreign languages. We first describe the collection and processing of the text corpus crawled from the World Wide Web using the Rapid Language Adaptation Toolkit. In par...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011